For this project we will attempt to use KMeans Clustering to cluster Universities into to two groups, Private and Public.
It is very important to note, we actually have the labels for this data set, but we will NOT use them for the KMeans clustering algorithm, since that is an unsupervised learning algorithm.
When using the Kmeans algorithm under normal circumstances, it is because you don't have labels. In this case we will use the labels to try to get an idea of how well the algorithm performed, but you won't usually do this for Kmeans, so the classification report and confusion matrix at the end of this project, don't truly make sense in a real world setting!.
We will use a data frame with 777 observations on the following 18 variables.
In [103]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Read in the College_Data file using read_csv. Figure out how to set the first column as the index.
In [104]:
df = pd.read_csv('College_Data',index_col=0)
Check the head of the data
In [105]:
df.head()
Out[105]:
Check the info() and describe() methods on the data.
In [106]:
df.info()
In [107]:
df.describe()
Out[107]:
In [111]:
sns.set_style('whitegrid')
sns.lmplot('Room.Board','Grad.Rate',data=df, hue='Private',
palette='coolwarm',size=6,aspect=1,fit_reg=False)
Out[111]:
Create a scatterplot of F.Undergrad versus Outstate where the points are colored by the Private column.
In [112]:
sns.set_style('whitegrid')
sns.lmplot('Outstate','F.Undergrad',data=df, hue='Private',
palette='coolwarm',size=6,aspect=1,fit_reg=False)
Out[112]:
Create a stacked histogram showing Out of State Tuition based on the Private column. Try doing this using sns.FacetGrid. If that is too tricky, see if you can do it just by using two instances of pandas.plot(kind='hist').
In [109]:
sns.set_style('darkgrid')
g = sns.FacetGrid(df,hue="Private",palette='coolwarm',size=6,aspect=2)
g = g.map(plt.hist,'Outstate',bins=20,alpha=0.7)
Create a similar histogram for the Grad.Rate column.
In [110]:
sns.set_style('darkgrid')
g = sns.FacetGrid(df,hue="Private",palette='coolwarm',size=6,aspect=2)
g = g.map(plt.hist,'Grad.Rate',bins=20,alpha=0.7)
Notice how there seems to be a private school with a graduation rate of higher than 100%.What is the name of that school?
In [113]:
df[df['Grad.Rate'] > 100]
Out[113]:
Set that school's graduation rate to 100 so it makes sense. You may get a warning not an error) when doing this operation, so use dataframe operations or just re-do the histogram visualization to make sure it actually went through.
In [93]:
df['Grad.Rate']['Cazenovia College'] = 100
In [94]:
df[df['Grad.Rate'] > 100]
Out[94]:
In [95]:
sns.set_style('darkgrid')
g = sns.FacetGrid(df,hue="Private",palette='coolwarm',size=6,aspect=2)
g = g.map(plt.hist,'Grad.Rate',bins=20,alpha=0.7)
In [114]:
from sklearn.cluster import KMeans
Create an instance of a K Means model with 2 clusters.
In [115]:
kmeans = KMeans(n_clusters=2)
Fit the model to all the data except for the Private label.
In [116]:
kmeans.fit(df.drop('Private',axis=1))
Out[116]:
What are the cluster center vectors?
In [117]:
kmeans.cluster_centers_
Out[117]:
There is no perfect way to evaluate clustering if you don't have the labels, however since this is just an exercise, we do have the labels, so we take advantage of this to evaluate our clusters, keep in mind, you usually won't have this luxury in the real world.
Create a new column for df called 'Cluster', which is a 1 for a Private school, and a 0 for a public school.
In [118]:
def converter(cluster):
if cluster=='Yes':
return 1
else:
return 0
In [119]:
df['Cluster'] = df['Private'].apply(converter)
In [122]:
df.head()
Out[122]:
Create a confusion matrix and classification report to see how well the Kmeans clustering worked without being given any labels.
In [123]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(df['Cluster'],kmeans.labels_))
print(classification_report(df['Cluster'],kmeans.labels_))